Goto

Collaborating Authors

 gender distribution


A Close Reading Approach to Gender Narrative Biases in AI-Generated Stories

Raffini, Daniel, Macori, Agnese, Angelini, Marco, Catarci, Tiziana

arXiv.org Artificial Intelligence

--The paper explores the study of gender-based narrative biases in stories generated by ChatGPT, Gemini, and Claude. The prompt design draws on Propp's character classifications and Freytag's narrative structure. The stories are analyzed through a close reading approach, with particular attention to adherence to the prompt, gender distribution of characters, physical and psychological descriptions, actions, and finally, plot development and character relationships. The results reveal the persistence of biases -- especially implicit ones -- in the generated stories and highlight the importance of assessing biases at multiple levels using an interpretative approach. In recent years, considerable attention has been paid to addressing the problem of bias in Large Language Models (LLMs). Despite ongoing efforts and improvements, LLMs still often do not adequately represent diversity and continue to propagate various forms of societal bias in their output [1] [2] [3]. The extensive use of LLMs for content creation and text generation makes this issue increasingly urgent. Regarding gender bias, studies have explored different aspects, such as the correlation between gender and occupation [4] [5], personas [6] [7], or the use of adjectives [8]. Many of these studies also compared LLMs' correlations with official social data on occupation and human perceptions [5] [9]. Methodologies for studying bias in LLMs can be divided into intrinsic and extrinsic approaches [10] [11]. The intrinsic approach includes embedding-and probability-based bias, while the extrinsic approach focuses on generation-based bias [12]. A recent study from UNESCO [13] provides a comprehensive application of various approaches by studying the connection of gendered words, asking LLMs to complete sentences, and generating entire stories. There are different modes of gender bias and stereotype propagation, and it is important to evaluate the issue from various points of view.


Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

Lyu, Hanjia, Luo, Jiebo, Kang, Jian, Koenecke, Allison

arXiv.org Artificial Intelligence

While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).


From Structured Prompts to Open Narratives: Measuring Gender Bias in LLMs Through Open-Ended Storytelling

Chen, Evan, Zhan, Run-Jun, Lin, Yan-Bai, Chen, Hung-Hsuan

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have revolutionized natural language processing, yet concerns persist regarding their tendency to reflect or amplify social biases present in their training data. This study introduces a novel evaluation framework to uncover gender biases in LLMs, focusing on their occupational narratives. Unlike previous methods relying on structured scenarios or carefully crafted prompts, our approach leverages free-form storytelling to reveal biases embedded in the models. Systematic analyses show an overrepresentation of female characters across occupations in six widely used LLMs. Additionally, our findings reveal that LLM-generated occupational gender rankings align more closely with human stereotypes than actual labor statistics. These insights underscore the need for balanced mitigation strategies to ensure fairness while avoiding the reinforcement of new stereotypes.


Exploring Gender Disparities in Automatic Speech Recognition Technology

ElGhazaly, Hend, Mirheidari, Bahman, Moosavi, Nafise Sadat, Christensen, Heidi

arXiv.org Artificial Intelligence

This study investigates factors influencing Automatic Speech Recognition (ASR) systems' fairness and performance across genders, beyond the conventional examination of demographics. Using the LibriSpeech dataset and the Whisper small model, we analyze how performance varies across different gender representations in training data. Our findings suggest a complex interplay between the gender ratio in training data and ASR performance. Optimal fairness occurs at specific gender distributions rather than a simple 50-50 split. Furthermore, our findings suggest that factors like pitch variability can significantly affect ASR accuracy. This research contributes to a deeper understanding of biases in ASR systems, highlighting the importance of carefully curated training data in mitigating gender bias.


INFELM: In-depth Fairness Evaluation of Large Text-To-Image Models

Jin, Di, Liu, Xing, Liu, Yu, Yap, Jia Qing, Wong, Andrea, Crespo, Adriana, Lin, Qi, Yin, Zhiyuan, Yan, Qiang, Ye, Ryan

arXiv.org Artificial Intelligence

The rapid development of large language models (LLMs) and large vision models (LVMs) have propelled the evolution of multi-modal AI systems, which have demonstrated the remarkable potential for industrial applications by emulating human-like cognition. However, they also pose significant ethical challenges, including amplifying harmful content and reinforcing societal biases. For instance, biases in some industrial image generation models highlighted the urgent need for robust fairness assessments. Most existing evaluation frameworks focus on the comprehensiveness of various aspects of the models, but they exhibit critical limitations, including insufficient attention to content generation alignment and social bias-sensitive domains. More importantly, their reliance on pixel-detection techniques is prone to inaccuracies. To address these issues, this paper presents INFELM, an in-depth fairness evaluation on widely-used text-to-image models. Our key contributions are: (1) an advanced skintone classifier incorporating facial topology and refined skin pixel representation to enhance classification precision by at least 16.04%, (2) a bias-sensitive content alignment measurement for understanding societal impacts, (3) a generalizable representation bias evaluation for diverse demographic groups, and (4) extensive experiments analyzing large-scale text-to-image model outputs across six social-bias-sensitive domains. We find that existing models in the study generally do not meet the empirical fairness criteria, and representation bias is generally more pronounced than alignment errors. INFELM establishes a robust benchmark for fairness assessment, supporting the development of multi-modal AI systems that align with ethical and human-centric principles.


Evaluating Gender Bias in Large Language Models

Döll, Michael, Döhring, Markus, Müller, Andreas

arXiv.org Artificial Intelligence

Gender bias in artificial intelligence has become an important issue, particularly in the context of language models used in communication-oriented applications. This study examines the extent to which Large Language Models (LLMs) exhibit gender bias in pronoun selection in occupational contexts. The analysis evaluates the models GPT-4, GPT-4o, PaLM 2 Text Bison and Gemini 1.0 Pro using a self-generated dataset. The jobs considered include a range of occupations, from those with a significant male presence to those with a notable female concentration, as well as jobs with a relatively equal gender distribution. Three different sentence processing methods were used to assess potential gender bias: masked tokens, unmasked sentences, and sentence completion. In addition, the LLMs suggested names of individuals in specific occupations, which were then examined for gender distribution. The results show a positive correlation between the models' pronoun choices and the gender distribution present in U.S. labor force data. Female pronouns were more often associated with female-dominated occupations, while male pronouns were more often associated with male-dominated occupations. Sentence completion showed the strongest correlation with actual gender distribution, while name generation resulted in a more balanced 'politically correct' gender distribution, albeit with notable variations in predominantly male or female occupations. Overall, the prompting method had a greater impact on gender distribution than the model selection itself, highlighting the complexity of addressing gender bias in LLMs. The findings highlight the importance of prompting in gender mapping.


GOSt-MT: A Knowledge Graph for Occupation-related Gender Biases in Machine Translation

Mastromichalakis, Orfeas Menis, Filandrianos, Giorgos, Tsouparopoulou, Eva, Parsanoglou, Dimitris, Symeonaki, Maria, Stamou, Giorgos

arXiv.org Artificial Intelligence

Gender bias in machine translation (MT) systems poses significant challenges that often result in the reinforcement of harmful stereotypes. Especially in the labour domain where frequently occupations are inaccurately associated with specific genders, such biases perpetuate traditional gender stereotypes with a significant impact on society. Addressing these issues is crucial for ensuring equitable and accurate MT systems. This paper introduces a novel approach to studying occupation-related gender bias through the creation of the GOSt-MT (Gender and Occupation Statistics for Machine Translation) Knowledge Graph. GOSt-MT integrates comprehensive gender statistics from real-world labour data and textual corpora used in MT training. This Knowledge Graph allows for a detailed analysis of gender bias across English, French, and Greek, facilitating the identification of persistent stereotypes and areas requiring intervention. By providing a structured framework for understanding how occupations are gendered in both labour markets and MT systems, GOSt-MT contributes to efforts aimed at making MT systems more equitable and reducing gender biases in automated translations.


Analyzing Diversity in Healthcare LLM Research: A Scientometric Perspective

Restrepo, David, Wu, Chenwei, Vásquez-Venegas, Constanza, Matos, João, Gallifant, Jack, Filipe, Luis

arXiv.org Artificial Intelligence

The deployment of large language models (LLMs) in healthcare has demonstrated substantial potential for enhancing clinical decision-making, administrative efficiency, and patient outcomes. However, the underrepresentation of diverse groups in the development and application of these models can perpetuate biases, leading to inequitable healthcare delivery. This paper presents a comprehensive scientometric analysis of LLM research for healthcare, including data from January 1, 2021, to June 16, 2024. By analyzing metadata from PubMed and Dimensions, including author affiliations, countries, and funding sources, we assess the diversity of contributors to LLM research. Our findings highlight significant gender and geographic disparities, with a predominance of male authors and contributions primarily from high-income countries (HICs). We introduce a novel journal diversity index based on Gini impurity to measure the inclusiveness of scientific publications. Our results underscore the necessity for greater representation in order to ensure the equitable application of LLMs in healthcare. We propose actionable strategies to enhance diversity and inclusivity in artificial intelligence research, with the ultimate goal of fostering a more inclusive and equitable future in healthcare innovation.


Using Artificial French Data to Understand the Emergence of Gender Bias in Transformer Language Models

Conti, Lina, Wisniewski, Guillaume

arXiv.org Artificial Intelligence

Numerous studies have demonstrated the ability of neural language models to learn various linguistic properties without direct supervision. This work takes an initial step towards exploring the less researched topic of how neural models discover linguistic properties of words, such as gender, as well as the rules governing their usage. We propose to use an artificial corpus generated by a PCFG based on French to precisely control the gender distribution in the training data and determine under which conditions a model correctly captures gender information or, on the contrary, appears gender-biased.


As Jobs Are Automated, Will Men and Women Be Affected Equally?

#artificialintelligence

I am writing this article while my baby daughter sleeps. Like all new parents, her dad and I have spent the last few months in a joy-filled, sleepy haze of getting to know her and imagining what her future might look like. This brings a new intensity, and a little more trepidation, to my role advising on the future of work. What will work look like for this generation of young women, especially as more and more of our roles are being automated -- or even replaced -- by artificial intelligence (AI)? And how can leaders ensure that AI does not lead to gender bias in their organizations? Recent research is beginning to answer these questions, and the outlook is mixed: on the one hand, women may be spared from the job disruptions men will face in the longer-term.